;;; -*- Mode: TEXT -*- ;;; File: AutoClass:doc;ac2-vs-ac3.text ;;;————————————————————————-;;; ;;; AUTOCLASS 3.0 Released 5/11/90 contact: Taylor@pluto.arc.nasa.gov ;;; ;;; by P. Cheeseman, J. Stutz, R. Hanson, W. Taylor ;;; ;;; NASA Ames Research Center, MS 244-17, Moffett Field, CA 94035 ;;; ;;; ;;; ;;; Copyright (C) 1990 Research Institute for Advanced Computer Science. ;;; ;;; All rights reserved. The RIACS Software Policy contains specific ;;; ;;; terms and conditions on the use of this software, and must be ;;; ;;; distributed with any copies. THIS FILE MAY BE REDISTRIBUTED. This ;;; ;;; copyright and notice must be preserved in all copies made of this file.;;; ;;;————————————————————————-;;; ;;; added 6/06/90 for 3.0.2 AutoClass-2 was built around the simplest useful class probability function. Basically this separately modeled each discrete attribute with a multinomial distribution and each real attribute with a normal distribution. Attribute interactions were assumed to be conditional on the class alone, thus ignoring the possibility of joint discrete and covariant real distributions. In both attribute types, missing values were allowed for by conditioning the basic distribution on a binomial distribution over the meta-values of `known' and `unknown'. Attributes could also be ignored. AutoClass-3.0 uses essentially identical class probability functions. The main difference between the two is in the way the probability function is implemented. The AutoClass-2 function is built into the internal search representation. In AutoClass-3 a probability function is implemented as a structure called a model. The model holds all of the functions and data specific to the application of the probability function to a particular data set. It is built at runtime from a user supplied model specification. The specification lists the types of the model function terms and the attributes to which they apply. The model terms define the independent probability distributions applicable to single or multiple attributes of appropriate type. A model term is implemented as a set of functions and data structures that are called from or copied into the model. Thus we obtain great flexibility in defining specific probability models and in extending the range of such models. There has been a considerable increase in the flexibility of input data formatting. Data vectors can be given in vector, list, or line mode. Discrete attribute values need no longer be translated to a zero based integer sequence. Any set of symbols, including strings, may be used. There is also provision for specifying output translations. There has been some increase in operating flexibility. Classifications, databases and models are now implemented as structures. Thus one can simultaneously work with multiple classifications of single or multiple databases. There are also a variety of initialization and search methods that have evolved from our own experiments. For most users, the most usefull change is the standard search function named AutoClass-Search. This has evolved from Robin Hanson's expriments on search efficiency and estimation of the optimal starting number of classes. Robin has found that by using the simplist (and fastest) initialization and convergence methods while developing an estimate of the optimal class number, one can get very good classifications more quickly than by any other of our methods. This has been implimented in the function Autoclass-Search, which offers: Automatic search for the best number of classes Runtime reports extimate rate of progress. More flexible choices for what to save to disk, how often, etc.